City College of San Francisco
MATH 108 - Foundations of Data Science
Associated Textbook Sections: 7.0, 7.1
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
Load the actors.csv data. The 'Total Gross', 'Average per Movie', and 'Gross' values represent Thousands of Dollars
actors = Table().read_table('./data/actors.csv')
actors
| Actor | Total Gross | Number of Movies | Average per Movie | #1 Movie | Gross |
|---|---|---|---|---|---|
| Harrison Ford | 4871.7 | 41 | 118.8 | Star Wars: The Force Awakens | 936.7 |
| Samuel L. Jackson | 4772.8 | 69 | 69.2 | The Avengers | 623.4 |
| Morgan Freeman | 4468.3 | 61 | 73.3 | The Dark Knight | 534.9 |
| Tom Hanks | 4340.8 | 44 | 98.7 | Toy Story 3 | 415 |
| Robert Downey, Jr. | 3947.3 | 53 | 74.5 | The Avengers | 623.4 |
| Eddie Murphy | 3810.4 | 38 | 100.3 | Shrek 2 | 441.2 |
| Tom Cruise | 3587.2 | 36 | 99.6 | War of the Worlds | 234.3 |
| Johnny Depp | 3368.6 | 45 | 74.9 | Dead Man's Chest | 423.3 |
| Michael Caine | 3351.5 | 58 | 57.8 | The Dark Knight | 534.9 |
| Scarlett Johansson | 3341.2 | 37 | 90.3 | The Avengers | 623.4 |
... (40 rows omitted)
The actor's name is a categorical attribute.
# use type( ) to check the type of value
type(actors.column('Actor').item(0))
str
The total gross dollar is a numerical attribute.
type(actors.column('Total Gross').item(0))
float
datascience library.'$12.00' is a str and likely to refelect a numerical attribute.Histograms tbl.hist are a standard way to visualize the distribution of one numerical variable.
Histograms will be focused on in the next lecture.
# tablename.hist('columnname', unit = " ") to create a histogram
actors.hist('Total Gross', unit="Thousands of Dollars")
# Some extra graph formatting you are not responsible for
plots.title('Distribution of Total Gross')
plots.show()
Line graphs tbl.plot and Scatter plots tbl.scatter are standard ways to visualize the relationship of two numerical variables.
# using .group to count the number of movies released each year
top_movies = Table.read_table('./data/top_movies_2023.csv')
movies_per_year = top_movies.group('Year').relabeled('count', 'Number of Movies')
# movies_per_year.show()
# tablename.plot( 'x-axis', 'y-axis')
movies_per_year.where('Year', are.above(1999)).plot('Year', 'Number of Movies')
plots.xticks(np.arange(2000, 2023, 5))
plots.title('Number of Movies vs. Release Year')
plots.show()
# scatter plot tablename.scatter('x-axis', 'y-axis')
actors.scatter('Number of Movies', 'Average per Movie')
plots.title('Average Pay per Movie (Thousands of Dollars) vs. Number of Movies')
plots.show()
Explore the US Census data from the Annual Estimates of the Resident Population by Single Year of Age and Sex for the United States.
(Release date: June 2021, Updated January 2022 to include April 1, 2020 estimates)
url = 'https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/national/asrh/nc-est2020-agesex-res.csv'
full = Table.read_table(url)
full
| SEX | AGE | CENSUS2010POP | ESTIMATESBASE2010 | POPESTIMATE2010 | POPESTIMATE2011 | POPESTIMATE2012 | POPESTIMATE2013 | POPESTIMATE2014 | POPESTIMATE2015 | POPESTIMATE2016 | POPESTIMATE2017 | POPESTIMATE2018 | POPESTIMATE2019 | POPESTIMATE2020 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3944153 | 3944160 | 3951495 | 3963264 | 3926731 | 3931411 | 3954973 | 3984144 | 3963268 | 3882437 | 3826908 | 3762227 | 3735010 |
| 0 | 1 | 3978070 | 3978090 | 3957904 | 3966768 | 3978210 | 3943348 | 3949559 | 3973828 | 4003586 | 3981864 | 3897917 | 3842257 | 3773884 |
| 0 | 2 | 4096929 | 4096939 | 4090799 | 3971498 | 3980139 | 3993047 | 3960015 | 3967672 | 3992657 | 4021261 | 3996742 | 3911822 | 3853025 |
| 0 | 3 | 4119040 | 4119051 | 4111869 | 4102429 | 3983007 | 3992839 | 4007852 | 3976277 | 3984985 | 4009060 | 4035053 | 4009037 | 3921526 |
| 0 | 4 | 4063170 | 4063186 | 4077511 | 4122252 | 4112849 | 3994539 | 4006407 | 4022785 | 3992241 | 4000394 | 4021907 | 4045996 | 4017847 |
| 0 | 5 | 4056858 | 4056872 | 4064653 | 4087770 | 4132349 | 4123745 | 4007123 | 4020489 | 4038022 | 4007233 | 4012789 | 4032231 | 4054336 |
| 0 | 6 | 4066381 | 4066412 | 4073031 | 4075153 | 4097860 | 4142923 | 4135738 | 4020428 | 4034969 | 4052428 | 4019106 | 4022432 | 4040169 |
| 0 | 7 | 4030579 | 4030594 | 4043100 | 4083399 | 4085255 | 4108453 | 4154947 | 4148711 | 4034355 | 4048430 | 4063647 | 4027876 | 4029753 |
| 0 | 8 | 4046486 | 4046497 | 4025624 | 4053313 | 4093553 | 4096033 | 4120476 | 4167765 | 4162142 | 4047130 | 4059209 | 4071894 | 4034785 |
| 0 | 9 | 4148353 | 4148369 | 4125413 | 4035854 | 4063662 | 4104437 | 4107986 | 4133426 | 4181069 | 4175085 | 4058207 | 4067320 | 4078668 |
... (296 rows omitted)
In the previous lecture, we did the following:
SEX, AGE, CENSUS2010POP, and POPESTIMATE2019 columns.SEX value is 0. Drop the SEX column since there is only one value there.partial = full.select('SEX', 'AGE', 'CENSUS2010POP', 'POPESTIMATE2019')
simple = partial.relabeled(2, '2010').relabeled(3, '2019')
no_999 = simple.where('AGE', are.below(999))
everyone = no_999.where('SEX', 0).drop('SEX')
everyone
| AGE | 2010 | 2019 |
|---|---|---|
| 0 | 3944153 | 3762227 |
| 1 | 3978070 | 3842257 |
| 2 | 4096929 | 3911822 |
| 3 | 4119040 | 4009037 |
| 4 | 4063170 | 4045996 |
| 5 | 4056858 | 4032231 |
| 6 | 4066381 | 4022432 |
| 7 | 4030579 | 4027876 |
| 8 | 4046486 | 4071894 |
| 9 | 4148353 | 4067320 |
... (91 rows omitted)
Visualize the relationship between age and population size in 2010.
#everyone.plot('AGE', '2010')
everyone.relabeled('2010', 'count').plot('AGE', 'count')
plots.title('US Population Size in 2010')
plots.show()
Include lines for both 2010 and the estimated 2019 population sizes.
# leave off 2010 label and it will plot for each column
everyone.plot('AGE')
plots.title('US Population Size')
plots.show()
Create a table with Age, Males, Females columns showing the population estimates in 2019 for males and females by age.
# first, delete female rows to create males table
# then, delete male rows to create females table (each of these will have one row for each age)
# using the age column from the males table (or the females table), then paste on male and female counts
males = no_999.where('SEX', 1).drop('SEX')
females = no_999.where('SEX', 2).drop('SEX')
pop_2019 = Table().with_columns(
'Age', males.column('AGE'),
'Males', males.column('2019'),
'Females', females.column('2019')
)
pop_2019
| Age | Males | Females |
|---|---|---|
| 0 | 1921001 | 1841226 |
| 1 | 1963261 | 1878996 |
| 2 | 2000102 | 1911720 |
| 3 | 2048651 | 1960386 |
| 4 | 2068251 | 1977745 |
| 5 | 2063176 | 1969055 |
| 6 | 2055583 | 1966849 |
| 7 | 2058425 | 1969451 |
| 8 | 2082403 | 1989491 |
| 9 | 2075719 | 1991601 |
... (91 rows omitted)
Visualize the distribution of of population size for both males and females.
pop_2019.plot('Age')
plots.title('2019 Population Size Estimates')
plots.show()
Calculate the percent female for each age
# total number of people = males.column('2019') + females.column('2019')
pct_female = (females.column('2019') / (males.column('2019') + females.column('2019'))) * 100
pct_female
array([ 48.93979018, 48.90344399, 48.87032181, 48.89917454,
48.88153622, 48.83289177, 48.89701056, 48.89552211,
48.85910586, 48.96592842, 48.98425388, 48.96313718,
48.91848904, 48.91588355, 48.95682562, 48.99213593,
49.00723665, 48.9917086 , 48.94499775, 48.85555766,
48.8800806 , 48.89699809, 48.95129043, 48.84655675,
48.77220901, 48.76311842, 48.68996749, 48.84567382,
49.115004 , 49.23311185, 49.27161137, 49.33570713,
49.34690992, 49.39653681, 49.57328862, 49.7823678 ,
49.88801204, 49.99258886, 50.08019625, 49.89892133,
50.1409379 , 50.20977831, 50.37327215, 50.36508359,
50.27570341, 50.48253869, 50.64261911, 50.57544456,
50.61870656, 50.44489454, 50.56911629, 50.63449931,
50.80649435, 50.81894266, 50.89138769, 51.13627062,
51.2696241 , 51.37238838, 51.53410868, 51.46437873,
51.72648051, 51.88456258, 52.09723728, 52.31329221,
52.44314993, 52.76149769, 52.92230043, 53.03484444,
53.26468499, 53.27081102, 53.40722561, 53.44223716,
53.51022877, 53.9509406 , 54.25448772, 54.58073446,
54.83251151, 55.26819948, 55.82854715, 56.17047137,
56.3748233 , 57.03744511, 57.64539476, 58.2875019 ,
59.12037315, 59.77448788, 60.61994754, 61.50555207,
62.43469375, 63.42875214, 64.36264302, 65.56129226,
66.59478489, 67.76493653, 69.03326813, 70.06426052,
70.77789932, 72.11473518, 72.70429851, 74.48479938, 76.57254933])
Round the values to 3 decimal places so that it's easier to read.
pct_female = np.round(pct_female, 2)
pct_female
array([ 48.94, 48.9 , 48.87, 48.9 , 48.88, 48.83, 48.9 , 48.9 ,
48.86, 48.97, 48.98, 48.96, 48.92, 48.92, 48.96, 48.99,
49.01, 48.99, 48.94, 48.86, 48.88, 48.9 , 48.95, 48.85,
48.77, 48.76, 48.69, 48.85, 49.12, 49.23, 49.27, 49.34,
49.35, 49.4 , 49.57, 49.78, 49.89, 49.99, 50.08, 49.9 ,
50.14, 50.21, 50.37, 50.36, 50.28, 50.48, 50.64, 50.58,
50.62, 50.44, 50.57, 50.63, 50.81, 50.82, 50.89, 51.14,
51.27, 51.37, 51.53, 51.46, 51.73, 51.88, 52.1 , 52.31,
52.44, 52.76, 52.92, 53.04, 53.26, 53.27, 53.41, 53.44,
53.51, 53.95, 54.25, 54.58, 54.83, 55.27, 55.83, 56.17,
56.38, 57.04, 57.64, 58.29, 59.12, 59.77, 60.62, 61.51,
62.44, 63.43, 64.36, 65.56, 66.6 , 67.76, 69.03, 70.06,
70.78, 72.11, 72.7 , 74.48, 76.57])
Add female percent to our table
pop_2019 = pop_2019.with_column('Percent Female', pct_female)
pop_2019
| Age | Males | Females | Percent Female |
|---|---|---|---|
| 0 | 1921001 | 1841226 | 48.94 |
| 1 | 1963261 | 1878996 | 48.9 |
| 2 | 2000102 | 1911720 | 48.87 |
| 3 | 2048651 | 1960386 | 48.9 |
| 4 | 2068251 | 1977745 | 48.88 |
| 5 | 2063176 | 1969055 | 48.83 |
| 6 | 2055583 | 1966849 | 48.9 |
| 7 | 2058425 | 1969451 | 48.9 |
| 8 | 2082403 | 1989491 | 48.86 |
| 9 | 2075719 | 1991601 | 48.97 |
... (91 rows omitted)
Visualize the relationship between age and the percent of the population that is female.
pop_2019.plot('Age', 'Percent Female')
plots.title('Female Population Percentage over Age')
plots.show()
Be careful of being visually mislead by the y-axis.
pop_2019.plot('Age', 'Percent Female')
plots.ylim(0, 100);
plots.title('Female Population Percentage over Age')
plots.show()
Visualize the relationship between the number of movies and the average pay per movie for each actor in the dataset.
# pull up actors data set again
# use .scatter
actors.scatter('Number of Movies', 'Average per Movie')
plots.title('Average per Movie (Thousands of Dollars) vs. Number of Movies')
plots.show()
Identify the outlier in the dataset.
# trying to find the movie with the average per movie above 400
# actors.sort('Average per Movie', descending=True)
# or
actors.where('Average per Movie', are.above(400))
| Actor | Total Gross | Number of Movies | Average per Movie | #1 Movie | Gross |
|---|---|---|---|---|---|
| Anthony Daniels | 3162.9 | 7 | 451.8 | Star Wars: The Force Awakens | 936.7 |
# to find the value
max(actors.column('Average per Movie'))
451.80000000000001
# to find actor
max_ave = max(actors.column('Average per Movie'))
actors.where('Average per Movie', max_ave).column('Actor').item(0)
'Anthony Daniels'
For all the visualization methods we use from the datascience library, if you put an i infront of the name of the visualization, you can access an interactive version of plot that is based on another visualization library called Plotly. You will not be tested on your knowledge of these interactive plots. You might find them helpful for exploring the data.
actors.iscatter('Number of Movies',
'Average per Movie',
labels='Actor',
title='Average per Movie (Thousands of Dollars) vs. Number of Movies')
barh are a standard way to visualize the distribution of a single categorical variable.cones = Table().read_table('./data/cones.csv')
cones_grouped_by_flavor = cones.group('Flavor')
cones_grouped_by_flavor.barh('Flavor')
plots.title('Distrubtion of Ice Cream Flavors')
plots.show()
The dataset top_movies_2023.csv shows the highest 1,000 grossing movies world wide listed on IMDB. Adjusted total gross values were also provided for data before 2021 using the Consumer Price Index (CPI)-based Python library cpi.
top_movies
| Created | Modified | Title | URL | Title Type | IMDb Rating | Runtime (mins) | Year | Genres | Num Votes | Release Date | Directors | Gross | Gross (Adjusted) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2023-01-06 | 2023-01-06 | Gone with the Wind | https://www.imdb.com/title/tt0031381/ | movie | 8.2 | 238 | 1939 | Drama, Romance, War | 318271 | 1939-12-15 | Sam Wood, George Cukor, Victor Fleming | 402382193 | 7.84414e+09 |
| 2023-01-06 | 2023-01-06 | Bambi | https://www.imdb.com/title/tt0034492/ | movie | 7.3 | 69 | 1942 | Animation, Adventure, Drama, Family | 145676 | 1942-08-09 | Samuel Armstrong, Paul Satterfield, Graham Heid, James A ... | 267447150 | 4.44602e+09 |
| 2023-01-06 | 2023-01-06 | Titanic | https://www.imdb.com/title/tt0120338/ | movie | 7.9 | 194 | 1997 | Drama, Romance | 1187108 | 1997-11-01 | James Cameron | 2201647264 | 3.71701e+09 |
| 2023-01-06 | 2023-01-06 | Avatar | https://www.imdb.com/title/tt0499549/ | movie | 7.9 | 162 | 2009 | Action, Adventure, Fantasy, Sci-Fi | 1318546 | 2009-12-10 | James Cameron | 2922917914 | 3.69178e+09 |
| 2023-01-06 | 2023-01-06 | Snow White and the Seven Dwarfs | https://www.imdb.com/title/tt0029583/ | movie | 7.6 | 83 | 1937 | Animation, Adventure, Family, Fantasy, Musical, Romance | 202792 | 1937-12-21 | William Cottrell, Ben Sharpsteen, David Hand, Perce Pear ... | 184925486 | 3.47981e+09 |
| 2023-01-06 | 2023-01-06 | Star Wars | https://www.imdb.com/title/tt0076759/ | movie | 8.6 | 121 | 1977 | Action, Adventure, Fantasy, Sci-Fi | 1372821 | 1977-05-25 | George Lucas | 775398007 | 3.46716e+09 |
| 2023-01-06 | 2023-01-06 | Avengers: Endgame | https://www.imdb.com/title/tt4154796/ | movie | 8.4 | 181 | 2019 | Action, Adventure, Drama, Sci-Fi | 1144892 | 2019-04-22 | Anthony Russo, Joe Russo | 2797501328 | 2.96506e+09 |
| 2023-01-06 | 2023-01-06 | The Exorcist | https://www.imdb.com/title/tt0070047/ | movie | 8.1 | 122 | 1973 | Horror | 413376 | 1973-12-26 | William Friedkin | 441306145 | 2.69326e+09 |
| 2023-01-06 | 2023-01-06 | Jaws | https://www.imdb.com/title/tt0073195/ | movie | 8.1 | 124 | 1975 | Adventure, Thriller | 612946 | 1975-06-20 | Steven Spielberg | 476512065 | 2.40001e+09 |
| 2023-01-06 | 2023-01-06 | Star Wars: Episode VII - The Force Awakens | https://www.imdb.com/title/tt2488496/ | movie | 7.8 | 138 | 2015 | Action, Adventure, Sci-Fi | 936837 | 2015-12-14 | J.J. Abrams | 2069521700 | 2.36598e+09 |
... (990 rows omitted)
Since Gone with the Wind has been re-released several times, the adjusted price is not the most honest representation of its adjusted gross proces. For a more comparable analysis, reduce the table to the top top 10 movies based on actual gross values ('Gross (Adjusted)') for the movies releasted in the last decade.
# pare down the columns
top_movies_select = top_movies.select('Title', 'Year', 'Gross (Adjusted)')
# remove all movies from before 2012
top_movies_last_decade = top_movies_select.where('Year', are.above(2011))
# sort by gross (adjusted)
top_movies_last_decade_sorted = top_movies_last_decade.sort('Gross (Adjusted)', True)
# table method .take( ) with a range inside
top10 = top_movies_last_decade_sorted.take(np.arange(10))
top10
| Title | Year | Gross (Adjusted) |
|---|---|---|
| Avengers: Endgame | 2019 | 2.96506e+09 |
| Star Wars: Episode VII - The Force Awakens | 2015 | 2.36598e+09 |
| Avengers: Infinity War | 2018 | 2.21039e+09 |
| Spider-Man: No Way Home | 2021 | 1.91631e+09 |
| Jurassic World | 2015 | 1.91099e+09 |
| The Avengers | 2012 | 1.79253e+09 |
| The Lion King | 2019 | 1.76269e+09 |
| Fast & Furious 7 | 2015 | 1.73242e+09 |
| Avengers: Age of Ultron | 2015 | 1.60376e+09 |
| Frozen II | 2019 | 1.53688e+09 |
Convert to the gross (adjusted) values to billions of dollars for readability.
# take Gross (adjusted) and divide by 1 bill and round to 2 dps
billions = np.round(top10.column('Gross (Adjusted)') / 1000000000, 2)
# add the column to top10
top10 = top10.with_column('Gross Adjusted (Billions)', billions).select(0, 1, 3)
top10
| Title | Year | Gross Adjusted (Billions) |
|---|---|---|
| Avengers: Endgame | 2019 | 2.97 |
| Star Wars: Episode VII - The Force Awakens | 2015 | 2.37 |
| Avengers: Infinity War | 2018 | 2.21 |
| Spider-Man: No Way Home | 2021 | 1.92 |
| Jurassic World | 2015 | 1.91 |
| The Avengers | 2012 | 1.79 |
| The Lion King | 2019 | 1.76 |
| Fast & Furious 7 | 2015 | 1.73 |
| Avengers: Age of Ultron | 2015 | 1.6 |
| Frozen II | 2019 | 1.54 |
Visualize the gross adjusted values for each of the top 10 grossing (adjusted) movies.
# use .barh
top10.barh('Title', 'Gross Adjusted (Billions)')
plots.title("The Top 10 Grossing Movies")
plots.show()
From Nathan Yau’s Data Points: Visualization that Means Something, our eyes can extract information at different levels of accuracy depending on the design.